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Method of generating a maximum entropy speech model 



The invention relates to a method of generating a maximum entropy speech 
model for a speech recognition system. 

When speech models are generated for speech recognition systems, there is the 
problem that the training corpora contain only limited quantities of training material. 
Probabilities of speech utterances that are only derived from the respective rates of 
occurrence in the training corpus are therefore subjected to smoothing procedures, for 
example, by backing-off techniques. However, backing-off speech models generally do not 
optimally utilize available training data, because unseen histories of N-grams are only 
compensated in that the respectively considered N-gram is shortened until a non-zero rate of 
occurrence in the training corpus is obtained. With maximum entropy speech models this 
problem may be counteracted (compare R. Rosenfeld, " A maximum entropy approach to 
adaptive statistical language modeling", Computer, Speech and Language, 1996, pp. 187- 
228). By means of such speech models both rates of occurrence of N-grams and gap N-grams 
in the training corpus can be used for the estimation of speech model probabilities, which is 
not the case with backing-off speech models. However, during the generation of a maximum 
entropy speech model the problem occurs that suitable boundary values are to be estimated 
on whose selection the iterated speech model values of the maximum entropy speech model 
depend. The speech model probabilities p*.(w | h) of such a speech model (w: vocabulary 
element; h: history of vocabulary elements relative to w) can be determined during a training, 
so that they satisfy as well as possible the boundary value equations of the form 



m a then represents a boundary value for a condition a to be set a priori, on whose satisfaction 
it depends whether the filter function f a (h, w) adopts the one value or the zero value. A 
condition a is then whether a considered sequence (h, w) of vocabulary elements is a certain 
N-gram (the term N-gram also includes gap N-grams), or ends in a certain N-gram (N > 1), 
while N-gram elements may also be classes that contain vocabulary elements that have a 



a 



= Z^(w|A) N{h)-f a {h >W ) 
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special relation to each other. N(h) denotes the rate of occurrence of the history h in the 
training corpus. 

From all the probability distributions that satisfy the boundary value equations 
the distribution that maximizes the specific entropy 



is selected for the maximum entropy modeling. The special distribution has the form of 



with suitable parameters X a . 

For the iteration of a maximum entropy speech model, specifically the so- 
called GIS algorithm (Generalized Iterative Scaling) is used, whose basic structure is 
described in J.N. Darroch, D. Ratcliff: "Generalized iterative scaling for log-linear models 1 ', 
The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972. An attempt at 
determining the said boundary values m a is based, for example, on the maximization of the 
probability of the training corpus used, which leads to boundary values m a = N(a), i.e. there 
is determined how often the conditions a are satisfied in the training corpus. This is 
described, for example, in S.A. Delia Pietra, V. J. Delia Pietra, J. Lafferty, "Inducing 
Features of random fields", Technical report, CMU-CS-95-144, 1995. The boundary values 
m a> however, often force several speech model probability values p\(w | h) of the models 
restricted by the boundary value equations to disappear (i.e. become zero), more particularly 
for sequences (h, w) not seen in the training corpus. Disappearing speech model probability 
values p\(w | h) are to be avoided for two reasons, however: the first reason is that a speech 
recognition system could in such cases not recognize lines with the word sequence (h, w), 
even if they were plausible recognition results, only because they do not appear in the 
training corpus. The other reason is that values p\(w | h) = 0 contradict the functional form of 
the solution from the above equation for px(w | h) as long as the parameters X a are limited to 
finite values. This so-called inconsistency (compare J.N. Darroch, D. Ratcliff mentioned 
above) prevents the solution of the boundary value equations with all the training methods 
known so far. 
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It is now the object of the invention to provide a method of generating 
maximum entropy speech models, so that an improvement of the statistical properties of the 
generated speech model is achieved. 

The object is achieved in that: 
by evaluating a training corpus, first probability values Pi n d(w | h) are formed for N-grams 
with N > 0; 

an estimate of second probability values px(w | h), which represent speech model values 
of the maximum entropy speech model, is made in dependence on the first probability 
values; 

boundary values m a are determined which correspond to the equation 

where N(h) is the rate of occurrence of the respective history h in the training corpus and 
f a (h, w) is a filter function which has a value different from zero for specific N-grams 
predefined a priori and featured by the index a, and otherwise has the zero value; 
an iteration of speech model values of the maximum entropy speech model is continued 
to be made until values m a (n) determined in the n th iteration step according to the formula 



= 2>£ n> (iv \h) • N(h) - f a (A, w) 

(h t w) 



sufficiently accurately approach the boundary values m a in accordance with a 

predefinable convergence criterion. 

Forming a speech model in this manner leads to a speech model that 
generalizes the statistics of the training corpus better to the statistics of the speech to be 
recognized, in that the estimate of the probabilities p^(w | h) uses different statistics of the 
training corpus for unseen word transitions (h, w): Besides the N-grams having a shorter 
range (as with backing-off speech models), it is also possible to take into account gap N-gram 
statistics and correlations between word classes when the values px(w | h) are estimated. 

There is more particularly provided that for the iteration of the speech model 
values of the maximum entropy speech model i.e. for the iterative training, the GIS algorithm 
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is used. The first probability values pi n d(w | h) are preferably backing-off speech model 
probability values. 

The invention also relates to a speech recognition system with an accordingly 
structured speech model. 



with speech signals in electrical form. A function block 3 summarizes an acoustic analysis, 
which leads to the fact that attribute vectors describing the speech signals are successively 
10 produced on the output 4. During the acoustic analysis the speech signals occurring in 

electrical form are sampled and quantized and subsequently combined to frames. Successive 
frames then preferably partly overlap. For each respective frame an attribute vector is 

L 

I formed. The function block 5 summarizes the search for the sequence of speech vocabulary 
elements that is the most probable for the entered sequence of attribute vectors. As is 



U115 customary in speech recognition systems, the probability of the recognition result is then 
j s l maximized with the aid of the so-called Bayes formula. Both an acoustic model of the speech 
^ signals (function block 6) and a linguistic speech model (function block 7) play a role in the 
processing according to function block 5. The acoustic model according to function block 6 

I s — 

pa § implies the customary use of so-called HMM models (Hidden Markov Models) for the 
'tjSO modeling of individual vocabulary elements or also a combination of a plurality of 
□ vocabulary elements. The speech model (function block 7) contains estimated probability 
values for vocabulary elements or sequences of vocabulary elements. This is referred to by 
the invention further to be explained hereinafter, which leads to the fact that the error rate of 
the recognition result produced on the output 8 is reduced. Furthermore, the complexity of 
25 the system is reduced. 



having probability values px(w | h) i.e. certain N-gram probabilities with N > 0 is used for N- 
grams (h, w) (with h as the history of N-l elements with respect to the vocabulary element 
w), which is based on a maximum entropy estimate. The searched distribution is then limited 
30 by certain marginal distributions and under these marginal conditions the maximum entropy 
model is chosen. The marginal conditions may relate both to N-grams of different lengths (N 
= 1, 2, 3, ...) and to gap N-grams, for example, to gap bigrams of the form (u, *, w), where * 
is a position retainer for at least one arbitrary N-gram element between the elements u and w. 
Similarly, N-gram elements may be class C elements, which summarize vocabulary elements 



5 



Examples of embodiment of the invention will be further explained in the 
following with reference to a drawing Figure. 

The Figure shows a speech recognition system 1 whose input 2 is supplied 



In the speech recognition system 1 according to the invention a speech model 
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that have a special relation to each other, for example, in that they show grammatical or 
semantic relations. 

The probabilities px(w | h) are estimated in a training on the basis of a training 
corpus (for example, NAB corpus - North American Business News) according to the 
5 following formula: 



^ (w|A) = z^) exp {^ Xa/a(/? ' w) ] with z .W=Z ex p{Z^A(^v)| (i) 



The quality factor of the speech model thus formed is decisively determined 
10 by the selection of boundary values m a on which the probability values px(w | h) for the 
speech model depend, which is expressed by the following formula: 

uJ 

. Fs 

iU ™„=2>x(HA) N(h) f a (h,w) (2) 

^5 The boundary values m a are estimated by means of an already calculated and 

available speech model having the speech model probabilities Pmd(w | h). Formula (2) is used 

ill for this purpose, in which only p^(w | h) is to be replaced by pj n d(w | h), so that an estimate is 
made of the m a in accordance with formula 



20 m a =YP^^\h)N{h)-f a {h,w) (3) 

<A.w) 



The values pj n d(w | h) are specifically probability values of a so-called 
backing-off speech model determined on the basis of the training corpus (see, for example, R. 
Kneser, H. Ney, "Improved backing-off for M-gram language modeling 1 ', ICASSP 1995, pp. 

25 181-185). The values pi n d( w I h) may, however, also be taken from other (already calculated) 
speech models assumed to be defined, as they are described, for example, in A. Nadas: 
"Estimation of Probabilities in the Language Model of the IBM Speech Recognition System", 
IEEE Trans, on Acoustics, Speech and Signal Proa, Vol. ASSP-32, pp. 859-861, Aug. 1984 
and in S.M. Katz: "Estimation of Probabilities from Sparse Data for the Language Model 

30 Component of a Speech Recognizer", IEEE Trans, on Acoustics, Speech and Signal Proa, 
Vol. ASSP-35, pp. 400-401, March 1987. 
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N(h) indicates the rate of the respective history h in the training corpus. 



f a (h, w) is a filter function corresponding to a condition a, which filter function has a value 
different from zero (here the value one) if the condition a is satisfied, and is otherwise equal 
to zero. The conditions a and the associated filter functions f a are heuristically determined 
for the respective training corpus. More particularly a choice is made here for which word or 
class N-grams or gap N-grams the boundary values are fixed. 



a considered N-gram ends in a certain vocabulary element w; 

a considered N-gram (h, w) ends in a vocabulary element w which belongs to a certain 
class C, which summarizes vocabulary elements that have a special relation to each other 
(see above); 

a considered N-gram (h, w) ends at a certain bigram (v, w) or a gap bigram (u, *, w) or a 
specific trigram (u, v, w) 5 etc.; 

a considered N-gram (h, w) ends in a bigram (v 3 w) or a gap bigram (u, *, w), etc., where 
the vocabulary elements u, v and w lie in certain predefined word classes C, D and E. 



equation (3) from a predefined a priori speech model with probability values p, n d(w | h), for 
certain groups of conditions a can respectively be predefined their own a priori speech 
models with probability values pi n a(w | h), while the boundary values according to equation 
(3) are then in this case separately calculated for each group from the associated a priori 
speech model. Examples for possible groups may particularly be formed by: 

word unigrams, word bigrams, word trigrams; 

word gap-1 bigrams (with a gap corresponding to a single word); 

word gap-2 -bigrams (with a gap corresponding to two words); 

class unigrams, class bigrams, class trigrams; 

class gap-1 -bigrams; 

class gap-2-bigrams. 



algorithm whose basic structure was described, for example, by J.N. Darroch, D. Ratcliff. A 
value M with 



Conditions a for which f a (h, w) has the value one, are preferably: 



In addition to the derivation of all the boundary values m a according to 



The speech model parameters X a are determined here with the aid of the GIS 




(4) 
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is then estimated. Furthermore, N then stands for the magnitude of the training corpus used 
i.e. the number of vocabulary elements the training corpus contains. Thus the GIS algorithm 
used may then be described as follows: 

Step 1 : Start with any start value p[ 0) ( w | h) 



Step 2: Updating of the boundary values in the n th travel through the iteration loop: 



10 



"*? = ^P?Hyv\hyN(h).f a (h 9 w) 



(5) 



where (w \ h) is calculated from the parameters A, a (n) determined in step 3 by 
insertion into formula (1). 



Mi 5 



Step 3: Updating of the parameters X a : 



*<«♦«> =A <»> + JL. log 

M 



M- l0g 



p 

P 



(6) 



where the term subtracted last is dropped, where for M holds: 



20 



M=Y,Mh y w) V (h,w) 



(7) 



m a or mp ( p is only another running variable) are the boundary values estimated according to 
formula (3) on the basis of the probability values pj n a(w | h). 



25 Step 4: Continuation of the algorithm with step 2 up to convergence of the algorithm. 

Convergence of the algorithm is understood to mean that the value of the 
difference between the estimated ma of formula (3) and the iterated value m a (n) is smaller 
than a predefinable and sufficiently small limit value E . 

As an alternative for the use of the GIS algorithm, any method may be used 
30 that calculates the maximum entropy solution for predefined boundary conditions, for 
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example, the Improved Iterative Scaling method which was described by S.A. Delia Pietra, 
V. J. Delia Pietra, J. Lafferty (compare above). 



